Diagnostic accuracy of three computer-aided detection systems for detecting pulmonary tuberculosis on chest radiography when used for screening: Analysis of an international, multicenter migrants screening study

The aim of this study was to independently evaluate the diagnostic accuracy of three artificial intelligence (AI)-based computer aided detection (CAD) systems for detecting pulmonary tuberculosis (TB) on global migrants screening chest x-ray (CXR) cases when compared against both microbiological and radiological reference standards (MRS and RadRS, respectively). Retrospective clinical data and CXR images were collected from the International Organization for Migration (IOM) pre-migration health assessment TB screening global database for US-bound migrants. A total of 2,812 participants were included in the dataset used for analysis against RadRS, of which 1,769 (62.9%) had accompanying microbiological test results and were included against MRS. All CXRs were interpreted by three CAD systems (CAD4TB v6, Lunit INSIGHT v4.9.0, and qXR v2) in offline setting, and re-interpreted by two expert radiologists in a blinded fashion. The performance was evaluated using receiver operating characteristics curve (ROC), estimates of sensitivity and specificity at different CAD thresholds against both microbiological and radiological reference standards (MRS and RadRS, respectively), and was compared with that of the expert radiologists. The area under the curve against MRS was highest for Lunit (0.85; 95% CI 0.83−0.87), followed by qXR (0.75; 95% CI 0.72−0.77) and then CAD4TB (0.71; 95% CI 0.68−0.73). At a set specificity of 70%, Lunit had the highest sensitivity (81.4%; 95% CI 77.9–84.6); at a set sensitivity of 90%, specificity was also highest for Lunit (54.5%; 95% CI 51.7–57.3). The CAD systems performed comparable to the sensitivity (98.3%), and except CAD4TB, to specificity (13.7%) of the expert radiologists. Similar trends were observed when using RadRS. Area under the curve against RadRS was highest for CAD4TB (0.87; 95% CI 0.86–0.89) and Lunit (0.87; 95% CI 0.85–0.88) followed by qXR (0.81; 95% CI 0.80–0.83). At a set specificity of 70%, CAD4TB had highest sensitivity (84.1%; 95% CI 82.3−85.8) followed by Lunit (80.9%; 95% CI 78.9−82.7); and at a set sensitivity of 90%, specificity was also highest for CAD4TB (54.6%; 95% CI 51.3−57.8). In conclusion, the study demonstrated that the three CAD systems had broadly similar diagnostic accuracy with regard to TB screening and comparable accuracy to an expert radiologist against MRS. Compared with different reference standards, Lunit performed better than both qXR and CAD4TB against MRS, and CAD4TB and Lunit better than qXR against RadRS. Moreover, the performance of the CADs can be impacted by characteristics of subgroup of population. The main limitation was that our study relied on retrospective data and MRS was not routinely done in individuals with a low suspicion of TB and a normal CXR. Our findings suggest that CAD systems could be a useful tool for TB screening programs in remote, high TB prevalent places where access to expert radiologists may be limited. However, further large-scale prospective studies are needed to address outstanding questions around the operational performance and technical requirements of the CAD systems.


Introduction
Plain chest radiography remains a crucial tool for early detection of pulmonary tuberculosis (TB) and the monitoring of responses to TB treatment [1]. Chest X-rays (CXRs) have a high sensitivity in detecting pulmonary TB abnormalities, even in asymptomatic TB patients, especially when interpreted by experienced radiologists. Despite this, out of the estimated 10 million global TB cases in 2019, only 7.1 million were detected and reported [2]. As a result, although the global TB incidence rate and annual number of TB deaths has been steadily declining, it is not yet in line with the targets set out in the World Health Organization's (WHO) End TB Strategy [2].
While advances in digital radiography technology have increased the quality of CXR images [3], limited access to these facilities and experienced radiologists remains a long-standing challenge, particularly in low-resource settings with a high TB burden [4]. However, recent advances in artificial intelligence (AI)-based computer-aided detection (CAD) systems have shown promising results in the automated interpretation of CXRs and detection of TB [5][6][7]. With acceptable accuracy, these CAD systems may help improve access to CXR reading for TB screening and contribute towards achieving WHO's End TB strategy [2,8]. However, there are limited number of studies in this area, most of which have methodological limitations, studied only one CAD software, with few screening data, and/or industry-funded [7][8][9][10]. Moreover, most studies used non-expert CXR interpreters and assessed an online CAD processing system or shared images with the CAD vendors and compared the performance against a suboptimal refence standard of a single sputum specimen tested with Xpert MTB/RIF which further highlights the need for independent and rigorous studies [8][9][10][11][12]. More recent investigations have focused on offline and multiple AI systems [13][14][15], but they remain few in number.
A global consultation, convened by WHO in 2016, concluded that additional evidence on the performance and use of available CAD systems for TB screening were required [16]. To address this need, the International Organization for Migration (IOM) and FIND entered into a research collaboration to conduct two parallel studies at their respective organizations, both evaluating the accuracy of TB CAD technologies. The studies were conducted independently of the developers, using similar study designs and analysis plans, but involving separate  [17].

PLOS GLOBAL PUBLIC HEALTH
Here we present the results of the IOM study, evaluating the diagnostic accuracy of three commercially available CAD systems for detecting TB in offline setting using an independent global archive of CXR images collected from multiple sites performing TB screening of migrants in different countries. against a microbiological (MRS) and radiological reference standard (RadRS).

Ethics statement
The study protocol has received ethical approval from McGill University Health Centre (MUHC) Research Ethics Board (REB) (project approval number 2019-4649). The study has also received IOM legal counsel approval to use the retrospective data and chest x-ray images of participants. CDC approval was obtained in addition for use of data from the pre-migration health assessment program of Migrants bound to the U.S.A. Informed consent was waived by the reviewing institutions since it was not feasible to locate the participants as they already resettled to the receiving country by the time the study was conducted.

Study design
The manufacturer-independent archive of CXR images, set up at the IOM Global Teleradiology and Quality Control Center in Manila, consisted of retrospective clinical data and DICOM CXR images from multiple pre-migration health assessment TB screening of migrants bound for the US. These screenings were conducted across 31 IOM migrant health assessment Centers (MHACs) in 18 different countries between October 2014 and December 2017. The distributions of the countries are summarized in S1 Table for MRS analysis population, and S2 Table for RadRS analysis population. For this study, all CXRs were analyzed by two experienced radiologists, as well as the three CAD systems.

Study participants and screening assessments
The IOM Migration Health Division conducts pre-migration TB screening of refugees and immigrants bound to different resettlement countries through its various MHACs located in different countries worldwide. IOM uses a Global web-based application, Migrant Management Operation System Application (MiMOSA) to record migrants' clinical information during the screening, and Local and Global Picture archive and communication systems (PACS) to archive the CXR images.
The TB screening of US-bound migrants is conducted in accordance with the US Centers for Disease Control and Prevention (CDC) Technical Instructions [18], which includes a clinical history, physical examination, and CXR examination (interpreted by qualified radiologist) using the standardized CXR reporting template provided by the US Department of State, DS-3030. Additionally, if the CXR reading is suggestive of TB or there is a clinical suspicion of TB, three consecutive sputum smear tests, plus solid and liquid culture tests, are completed. Molecular diagnostic tests such as Xpert are also performed if fast results are required or if there is a suspicion of drug-resistance. Participants eligible for inclusion in this study were 15 years or older, with a TB screening CXR for which the initial CXR interpretation and reference standards were available. No images included in this study had ever been shared with any of the CAD system manufacturers.

Sample size and sampling
The sample size was calculated to demonstrate minimum CAD system accuracy targets of 90% sensitivity and 70% specificity, based on the WHO target product profile (TPP) for a TB triage test [19]. The minimum sample size required to detect these sensitivity and specificity targets with 90% power and a 95% confidence interval (CI) of 10% or less was 536 TB and 789 non-TB cases. These numbers were increased by 10% to account for missing information, resulting in a final target population of 590 TB and 868 non-TB cases.
Two samples were drawn from the screening archive (Sample 1 and Sample 2), one for each reference standard. Records were extracted from the MHACs with the highest caseloads first, until the sample size targets had been met.

Data preparations
Biographic information, clinical and laboratory results, and original radiology readings were extracted from the IOM electronic Migrant Management Application, MiMOSA global database, and anonymized before being entered in the study dataset. DICOM CXR images of all participants were collected from each MHACs Local or Global PACS systems, as required, and also anonymized before further use in the study. Clinical and DICOM data were merged into one dataset using a unique participant identifier.
All three CAD systems read posterior-anterior (PA) CXR DICOM images and provide an abnormality score ranging from 0-100 (CAD4TB) or 0-1 (Lunit and qXR). A secondary image with a heatmap (CAD4TB and Lunit) or bounding boxes (qXR) is also produced that indicates the location of the identified abnormal findings (S1 Fig; S1A-S1D Fig). Lunit and qXR have manufacturer-recommended thresholds for TB, while CAD4TB users are required to determine the threshold via a verification process (using data from the user site). Three threshold scores were provided by Lunit, either favoring high sensitivity (score = 0.15), high specificity (score = 0.45) or a middle threshold (score = 0.3). Two thresholds were provided by qXR: a "routine TB screening threshold score" of 0.55 and a "high-risk TB threshold score" of 0.75.
For CAD4TB and qXR, a verification or test run was conducted on sets of CXRs from 13 types of different X-ray machine models used by IOM, which did not form part of this study, as required by the manufacturers at that time. Out of 13 tested, 10 X-ray machine models passed the verification for CAD4TB, and CXRs from those machines were included in the study (Agfa CR10-X, Agfa CR15-X, Agfa CR30-X, CareStream CR975, CareStream DRX-1, CareStream VitaCR, DRGEM, FUJIFILM, Kodak Point Of Care 260, SHIMADZU and SIE-MENS). CAD CXR interpretation of DICOM images from study participants was carried out by IOM as per the manufacturer's instructions, using offline server-installed CAD licenses. Only the PA CXR of the initial health assessment for each participant was used for the CAD interpretation, even if some participants had additional CXR views and follow-up CXRs.

Reference standards (microbiological [MRS] and radiological [RadRS]).
For MRS analyses, a TB case was defined as a positive result on at least one out of three sputum cultures collected on consecutive days during the initial screening assessment. A non-TB case was defined as: 1) a negative result for all three sputum cultures, 2) the identification of non-tuberculous mycobacterium; and/or 3) at least one negative sputum culture result if the rest of the samples were contaminated. Only results from specimens taken within 14 days of the CXR were included. In the few cases where Xpert analyses were conducted, a positive Xpert result was interpreted as a positive MRS, even if the culture result was negative.
For RadRS analyses, all CXRs were analyzed by two certified IOM consultant radiologists, with 10 years of experience in TB screening, who had received regular training specific on TB screening CXR interpretation, and who performed best in the regular internal monitoring and evaluation program of IOM using Key performance indicators.
The radiologists were blinded to the clinical and original CXR findings, as well as to the CAD results. Each specialist assessed half of the CXRs using the DS-3030 CXR reporting template containing a specified list of TB and non-TB findings (S3 Table). This re-assessment of the CXRs was conducted to reduce inter-reader variability and standardize the readings, as the original CXR interpretations were performed by several radiologists at different MHACs. If the image quality was not deemed to be acceptable, or if additional CXR views would have been required to complete the interpretation, the radiologists could exclude these CXRs from the analysis. When the new CXR readings showed major discrepancies with the original readings, CXR images were reviewed by a quality control radiologist, who provided a final reading after review of all interpretations from all sources. For RadRS, a TB case was defined as a CXR that was suggestive of active TB disease or old, healed TB (categories 2 and 3 of the CXR classification form; S3 Table). A non-TB case was defined as a normal CXR or one which showed other non-TB findings (categories 1, 4, 5, and 6; S3 Table).

Data analysis
Clinical data, CXR readings, and CAD scores were collated into one dataset and any duplicates identified were excluded prior to analysis. Histograms of the CAD abnormality scores were plotted, receiver operating characteristic (ROC) curves calculated and the area under the curve (AUC) evaluated for each CAD system against both reference standards, using binomial distribution assumptions.
Estimates of sensitivity and specificity were also calculated at: 1) predefined points for sensitivity or specificity based on WHO triage TPP (90% sensitivity and 70% specificity); and 2) manufacturer-provided CAD score thresholds (only for Lunit and qXR) against MRS and RadRS. The sensitivity and specificity of radiologist assessments were also calculated against MRS. Finally, the sensitivity and specificity of each of the CAD systems against MRS were calculated at the threshold that produced the same specificity or sensitivity achieved by the radiologists.
Subgroup analyses for AUC, sensitivity and specificity were conducted for the following groups: age (15−35 years, 36−55 years, 56+ years), sex, geographical region, high-risk groups (e.g., a history of previous TB), migrant type (refugee vs immigrant), HIV status (if known), presence of TB symptoms, sputum smear status, presence of some image quality issues even if the images were deemed acceptable overall, and the presence of additional CXR views obtained during the screening and re-assessed by expert radiologists. Stata software version 16 was used for data management and analysis [20].

Study selection
A total of 2,910 cases were sampled (Fig 1): 589 culture-positive and 865 culture-negative from Sample 1, and 590 CXR suggestive of TB and 866 CXR not suggestive of TB from Sample 2.

Baseline demographic and clinical characteristics
Baseline demographic and clinical characteristics of the whole study population (RadRS) and the population included in the analysis against MRS are presented in S4 Table and Table 1, respectively. Similarly, CXR findings and microbiological test results among RadRS and MRS population are presented in S5 Table and Table 2, respectively.
In the MRS analysis population, more than half (60.5%) were male, most were young (36.2% were 15−35 years of age), 30.1% were MRS positive, and 89.9% had CXR suggestive of TB (RadRS positive). MRS-positive TB cases were reported more often among males (67.0%), at a younger age (44.5% were 15-35 years of age), and in those with TB symptoms compared to non-TB cases (3.6% vs 0.6%). Sputum smears were positive in 29.8% of MRS-positive cases, and abnormal CXR findings in 99%, 97.5% of which were CXR suggestive of TB (Table 1). However, 91% of the MRS negative cases also had abnormal CXR findings, 84.6% of which were CXR suggestive of TB. Only 4.7% of MRS TB cases had Xpert results in addition to cultures, and only two (0.1%) had discrepancies, one being culture-negative and Xpert-positive, and the other being culture-positive and Xpert-negative (Table 2).
In the RadRS analysis population (2812), similarly, most were male (55.3%), younger age group  year (45.4%), but only 1.2% had one or more TB symptoms and 10% had smear positive result (S4 Table). The 63.3% of RadRS analysis population had CXR findings suggestive of TB (RadRS positive cases. Of those, 32.3% were culture positive (S5 Table).

Histogram distribution of index tests
Abnormality scores of all three CAD systems showed some bimodal distribution when plotted in a two-way histogram against MRS, with a wide range of overlap between TB and non-TB cases (
The point estimate for the sensitivity (98.3%; 95% CI 96.8−99.2%) and specificity (13.7%; 95% CI 11.8−15.7) of the expert radiologist is presented in the ROC against MRS for visual comparison of the radiologist performance with the performance of the CADs (Fig 2A), and it overlies along the line for Lunit.

TB (%) non-TB (%) Total in MRS (%)
Image processing errors were noticed for 178 CXR images after processing by Lunit, in which the images were inverted from the original negative image (bones white) to positive (bones black). The sensitivity and specificity values, as well as the AUC of Lunit with and without those cases included, were unaffected by these processing errors.

Diagnostic accuracy of index tests in different population subgroups
The diagnostic accuracy, expressed in terms of the AUC of the ROC curve for all three CAD systems, was lower in cases with a history of pulmonary TB compared to those without a history of TB, and lower with CAD4TB in smear-negative cases (0.66; 95% CI 0.63−0.69), compared to smear-positive cases (0.82; 95% CI 0.70−0.94) and in those with additional views (0.58; 95% CI 0.50−0.65), compared to those without additional views (0.72; 95% CI 0.69 −0.75) (S3 Fig and S7 Table). For other subgroups such as female sex, HIV infected, absence of TB symptoms, and immigrants (and for Lunit the older age group), CAD systems appeared to show lower AUC estimates compared with their opposing subgroups, though the CIs overlapped. Other subgroups, such as image quality and region, did not show any additional trends ( S3 Fig and S7 Table).

Discussion
This study is one of the first comprehensive studies evaluating CAD systems independent of the CAD developers in a population screened for TB using both culture results and expert radiologist assessments as reference standards. The findings from the study demonstrated that the three CAD systems (Lunit, CAD4TB, qXR) have comparable diagnostic accuracy in detecting TB on CXR when used for TB screening and may perform comparably to that of expert radiologists, with Lunit performing better than both qXR and CAD4TB against MRS and CAD4TB and Lunit performing better than qXR against RadRS. However, none of the CAD systems reached the minimum performance requirements of the WHO triage TPP (90% sensitivity and 70% specificity) [19], in contrast to previously published findings by Khan et al. [13]. The finding that i) Lunit performed best against MRS and ii) CAD4TB performed best against RadRS, shows that the CADs performance can vary by the reference standard used, and may indicate that Lunit is better at detecting CXR findings suggestive of active TB disease, which tend to be culture-positive, while CAD4TB may better detect CXR findings suggestive of old, healed TB that can be identified by radiologists, but tend to be culture-negative. This finding could also reflect the methodology used in the deep machine learning of the CAD product algorithms, e.g., mainly training the software against RadRS versus MRS. The better performance of CAD4TB against RadRS than MRS is also supported by the results of a previous study by Fehr et al. [11].
The low specificity of the CADs at a set sensitivity of 90% against MRS is similar to the expert radiologist and likely is a result of the selection of our study population in whom sputum samples were usually only collected when the initial CXR reading was suggestive of TB or when there was a clinical suspicion of TB. However, also the intrinsic nature of sputum culture, CXR, and CXR signs of TB may be an explanation. Culture analysis detects TB in cases with detectable bacteria in the sputum. As such, it measures the sensitivity of detecting active TB disease, whereas, old, healed pulmonary lesions detected by CXR can be culture-negative and are considered false-positive in the analysis against MRS. Additionally, CXR signs of TB are not specific to TB only, thereby reducing the estimated specificity. Therefore, CXR is recommended for screening but not as a confirmatory diagnostic tool (i.e., a positive CXR TB screening result should be used as criteria for further confirmatory testing, such as sputum cultures, and not for a treatment decision). Nevertheless, for a screening tool the benefit of high sensitivity may outweigh the limitations of a lower specificity. Both Lunit and qXR had relatively lower sensitivity and specificity at all manufacture provided thresholds, though Lunit performed with relatively higher sensitivity while qXR achieved higher specificity. As such, the sensitivity and specificity thresholds of the CADs that correspond with expert radiologist assessments (98.3% and 13.7%, respectively), may be potential candidates for the selection of optimal thresholds for operational use.
Subgroup analyses showed that the performance of CADs can vary among some population demographic and clinical characteristics. All CAD systems performed worse in participants with a history of TB, something which has also been observed in previous studies [13][14][15]. This is to be expected, as healed TB can leave residual CXR changes, which usually are classified as TB findings on CXR but can lead to negative microbiological test results. CAD4TB performed worse in participants with smear-negative results, in line with the findings of Khan et al. [13]; CAD4TB, moreover, performed worse in cases where additional CXR views were requested by the expert radiologist. Again, these results are not surprising, as smear-positive cases may have obvious CXR abnormalities that can be easily detected, and the absence of a request for additional views may indicate that the initial CXR was of good quality and/or there were no suspicious CXR findings. However, this conclusion was significant only for CAD4TB, while a similar although not statistically significant trend was observed for Lunit and qXR.
Additional trends were observed in the other subgroup analyses. While overlapping CI values indicate that these findings should be interpreted with caution, it appeared that the CAD systems performed less well in females, participants with no TB symptoms, HIV-positive participants, those with an 'immigrant' status compared with those classified with a 'refugee' status, and in older participants (for Lunit only). Other studies have also reported that CAD performance can be significantly impacted by sex and age [13]. The differences in CAD performance among different subgroups indicates that population characteristics should be taken into consideration before implementation.
There are some limitations to this study that should be considered. Firstly, our study relied on retrospective data from a routine migration screening program. As such participants received sputum smear and culture tests during the initial TB screening only when the initial CXR reading was suggestive of TB or there was a clinical suspicion of TB. Therefore, sputum culture testing was not performed for most participants with normal CXRs or CXR findings suggestive of non-TB and were not included in the MRS analysis. This likely resulted in spuriously increase in sensitivity and lower specificity readings for the CADs and expert radiologists, as would be expected from an unselected population. Moreover, as TB cases were overrepresented for both the MRS and RadRS analyses due to the sampling strategy employed in this study, the dataset may not be representative of all people presenting for TB screening but is instead a subset of those who had a higher suspicion of TB and therefore underwent sputum examination. This might have affected the overall accuracy estimates, albeit to a similar extent for all three CAD systems and the expert radiologists, thus, we believe the comparison between the accuracy of CADs versus expert radiologists holds true. Additionally, 20 participants with no radiologist assessment were excluded from the analyses, as well as 207 images from the CAD4TB analysis due to invalid score results. The reason for the invalid scores with the CAD4TB system was unknown, but it could be because the software quality control rejected unacceptable or poor images without requiring further investigation. Although the number of excluded results is small compared with the size of the whole dataset, it is possible that the characteristics of the cases excluded may have been different from those that were included.
Although the study did not systematically evaluate quality control measures of the CADs, some issues were observed during the automated interpretation of the CXRs by the CAD systems. Some CXR projections other than the PA CXRs, such as lateral and lordotic CXR views (which are unsupported by the CADs) or CXRs with image quality issues, were not always flagged by the systems.
The study also did not assess the operational performance of the CADs such as the processing time, technical issues, and troubleshooting responses, infrastructure needs, comparison of offline and online use of the CAD product, cost-effectiveness, or other related matters. In addition, since the study was conducted new versions of the CAD systems have been released and other CAD systems have entered the market [21], which may necessitate further evaluation.
Based on the findings of this study, combined with those of the parallel study conducted by FIND [22], CAD systems may be considered viable as a tool for automated CXR interpretation with regard to TB detection in screening programs, particularly in remote, and/or high TB burden places where there are limited resources and access to expert radiologists. The use of CAD systems in these areas may even have a wider application and contribute to increase the global TB detection rate. Further to these, and other, findings, WHO has recently released consolidated guidelines on tuberculosis recommending that CAD may be used in place of human readers for interpreting digital CXR for TB screening in individuals aged 15 years and older [17]. Another role of CADs, even in places where expert radiologists are available, may be their use for internal quality control monitoring of CXRs complementary to radiologist assessments.
Nevertheless, further studies may be required to investigate the accuracy of CADs in detecting non-TB-significant findings, such as lung cancer or bone lesions as well as the different specific CXR findings suggestive of TB, better address the performance of CADs in the different population subgroups, the way the CADs address image quality issues that might necessitate repeat CXRs or additional views by radiologists, how the CADs handle non-PA CXRs, and non-complied age requirements for specific systems. Likewise, prospective studies are needed to address the operational use of the CAD systems [23], including choice of the CAD system and version, compatibility with X-ray machine, accepted image format, need for validation, integration into existing workflow and patient registration systems, feasibility of online or offline use of the software, and technical requirements, as well as the selection of optimal thresholds for the intended use.
In conclusion, the results of this study demonstrated the comparability of the accuracy of three CAD systems for CXR interpretation with regard to TB screening, which may broadly perform similar to that of an expert radiologist. Additionally, the study has demonstrated that the performance of the CAD systems can vary by population demographic and clinical characteristics as well as the reference standard used. As such, these tools may provide viable options for use in TB screening programs to increase TB detection, especially in low resource areas where there may be no available expert radiologists. However, further studies are needed to better address CAD performance in specific population subgroups or different CXR TB findings, to assess other operational and technical factors necessary for proper operational implementation, and to evaluate novel CAD products coming to the market.
Supporting information S1